25 research outputs found
Question Answering from Unstructured Text by Retrieval and Comprehension
Open domain Question Answering (QA) systems must interact with external
knowledge sources, such as web pages, to find relevant information. Information
sources like Wikipedia, however, are not well structured and difficult to
utilize in comparison with Knowledge Bases (KBs). In this work we present a
two-step approach to question answering from unstructured text, consisting of a
retrieval step and a comprehension step. For comprehension, we present an RNN
based attention model with a novel mixture mechanism for selecting answers from
either retrieved articles or a fixed vocabulary. For retrieval we introduce a
hand-crafted model and a neural model for ranking relevant articles. We achieve
state-of-the-art performance on W IKI M OVIES dataset, reducing the error by
40%. Our experimental results further demonstrate the importance of each of the
introduced components
Quasar: Datasets for Question Answering by Search and Reading
We present two new large-scale datasets aimed at evaluating systems designed
to comprehend a natural language query and extract its answer from a large
corpus of text. The Quasar-S dataset consists of 37000 cloze-style
(fill-in-the-gap) queries constructed from definitions of software entity tags
on the popular website Stack Overflow. The posts and comments on the website
serve as the background corpus for answering the cloze questions. The Quasar-T
dataset consists of 43000 open-domain trivia questions and their answers
obtained from various internet sources. ClueWeb09 serves as the background
corpus for extracting these answers. We pose these datasets as a challenge for
two related subtasks of factoid Question Answering: (1) searching for relevant
pieces of text that include the correct answer to a query, and (2) reading the
retrieved text to answer the query. We also describe a retrieval system for
extracting relevant sentences and documents from the corpus given a query, and
include these in the release for researchers wishing to only focus on (2). We
evaluate several baselines on both datasets, ranging from simple heuristics to
powerful neural models, and show that these lag behind human performance by
16.4% and 32.1% for Quasar-S and -T respectively. The datasets are available at
https://github.com/bdhingra/quasar
Combating Adversarial Misspellings with Robust Word Recognition
To combat adversarial spelling mistakes, we propose placing a word
recognition model in front of the downstream classifier. Our word recognition
models build upon the RNN semi-character architecture, introducing several new
backoff strategies for handling rare and unseen words. Trained to recognize
words corrupted by random adds, drops, swaps, and keyboard mistakes, our method
achieves 32% relative (and 3.3% absolute) error reduction over the vanilla
semi-character model. Notably, our pipeline confers robustness on the
downstream classifier, outperforming both adversarial training and
off-the-shelf spell checkers. Against a BERT model fine-tuned for sentiment
analysis, a single adversarially-chosen character attack lowers accuracy from
90.3% to 45.8%. Our defense restores accuracy to 75%. Surprisingly, better word
recognition does not always entail greater robustness. Our analysis reveals
that robustness also depends upon a quantity that we denote the sensitivity.Comment: ACL 2019, long pape
A Comparative Study of Word Embeddings for Reading Comprehension
The focus of past machine learning research for Reading Comprehension tasks
has been primarily on the design of novel deep learning architectures. Here we
show that seemingly minor choices made on (1) the use of pre-trained word
embeddings, and (2) the representation of out-of-vocabulary tokens at test
time, can turn out to have a larger impact than architectural choices on the
final performance. We systematically explore several options for these choices,
and provide recommendations to researchers working in this area
Probing Biomedical Embeddings from Language Models
Contextualized word embeddings derived from pre-trained language models (LMs)
show significant improvements on downstream NLP tasks. Pre-training on
domain-specific corpora, such as biomedical articles, further improves their
performance. In this paper, we conduct probing experiments to determine what
additional information is carried intrinsically by the in-domain trained
contextualized embeddings. For this we use the pre-trained LMs as fixed feature
extractors and restrict the downstream task models to not have additional
sequence modeling layers. We compare BERT, ELMo, BioBERT and BioELMo, a
biomedical version of ELMo trained on 10M PubMed abstracts. Surprisingly, while
fine-tuned BioBERT is better than BioELMo in biomedical NER and NLI tasks, as a
fixed feature extractor BioELMo outperforms BioBERT in our probing tasks. We
use visualization and nearest neighbor analysis to show that better encoding of
entity-type and relational information leads to this superiority.Comment: NAACL-HLT 2019 Workshop on Evaluating Vector Space Representations
for NLP (RepEval
Linguistic Knowledge as Memory for Recurrent Neural Networks
Training recurrent neural networks to model long term dependencies is
difficult. Hence, we propose to use external linguistic knowledge as an
explicit signal to inform the model which memories it should utilize.
Specifically, external knowledge is used to augment a sequence with typed edges
between arbitrarily distant elements, and the resulting graph is decomposed
into directed acyclic subgraphs. We introduce a model that encodes such graphs
as explicit memory in recurrent neural networks, and use it to model
coreference relations in text. We apply our model to several text comprehension
tasks and achieve new state-of-the-art results on all considered benchmarks,
including CNN, bAbi, and LAMBADA. On the bAbi QA tasks, our model solves 15 out
of the 20 tasks with only 1000 training examples per task. Analysis of the
learned representations further demonstrates the ability of our model to encode
fine-grained entity information across a document
Simple and Effective Semi-Supervised Question Answering
Recent success of deep learning models for the task of extractive Question
Answering (QA) is hinged on the availability of large annotated corpora.
However, large domain specific annotated corpora are limited and expensive to
construct. In this work, we envision a system where the end user specifies a
set of base documents and only a few labelled examples. Our system exploits the
document structure to create cloze-style questions from these base documents;
pre-trains a powerful neural network on the cloze style questions; and further
fine-tunes the model on the labeled examples. We evaluate our proposed system
across three diverse datasets from different domains, and find it to be highly
effective with very little labeled data. We attain more than 50% F1 score on
SQuAD and TriviaQA with less than a thousand labelled examples. We are also
releasing a set of 3.2M cloze-style questions for practitioners to use while
building QA systems.Comment: Short paper, NAACL 201
Text Generation with Exemplar-based Adaptive Decoding
We propose a novel conditioned text generation model. It draws inspiration
from traditional template-based text generation techniques, where the source
provides the content (i.e., what to say), and the template influences how to
say it. Building on the successful encoder-decoder paradigm, it first encodes
the content representation from the given input text; to produce the output, it
retrieves exemplar text from the training data as "soft templates," which are
then used to construct an exemplar-specific decoder. We evaluate the proposed
model on abstractive text summarization and data-to-text generation. Empirical
results show that this model achieves strong performance and outperforms
comparable baselines.Comment: NAACL 201
Gated-Attention Readers for Text Comprehension
In this paper we study the problem of answering cloze-style questions over
documents. Our model, the Gated-Attention (GA) Reader, integrates a multi-hop
architecture with a novel attention mechanism, which is based on multiplicative
interactions between the query embedding and the intermediate states of a
recurrent neural network document reader. This enables the reader to build
query-specific representations of tokens in the document for accurate answer
selection. The GA Reader obtains state-of-the-art results on three benchmarks
for this task--the CNN \& Daily Mail news stories and the Who Did What dataset.
The effectiveness of multiplicative interaction is demonstrated by an ablation
study, and by comparing to alternative compositional operators for implementing
the gated-attention. The code is available at
https://github.com/bdhingra/ga-reader.Comment: Accepted at ACL 201
Embedding Text in Hyperbolic Spaces
Natural language text exhibits hierarchical structure in a variety of
respects. Ideally, we could incorporate our prior knowledge of this
hierarchical structure into unsupervised learning algorithms that work on text
data. Recent work by Nickel & Kiela (2017) proposed using hyperbolic instead of
Euclidean embedding spaces to represent hierarchical data and demonstrated
encouraging results when embedding graphs. In this work, we extend their method
with a re-parameterization technique that allows us to learn hyperbolic
embeddings of arbitrarily parameterized objects. We apply this framework to
learn word and sentence embeddings in hyperbolic space in an unsupervised
manner from text corpora. The resulting embeddings seem to encode certain
intuitive notions of hierarchy, such as word-context frequency and phrase
constituency. However, the implicit continuous hierarchy in the learned
hyperbolic space makes interrogating the model's learned hierarchies more
difficult than for models that learn explicit edges between items. The learned
hyperbolic embeddings show improvements over Euclidean embeddings in some --
but not all -- downstream tasks, suggesting that hierarchical organization is
more useful for some tasks than others.Comment: TextGraphs 201